把服務的「可觀測性」從只有結構化日誌擴展到指標 Metrics與追蹤 Traces。用 OpenTelemetry(OTel)做統一的 SDK、協定與傳輸,搭配 Collector 接到你愛的後端(Prometheus、Tempo/Jaeger、Datadog、Cloud Monitoring…)。最後把三者串起來:Log 內含 trace_id,Metrics 與 Traces 互跳,問題從「感覺」變「證據」。
雲原生場景下你會同時需要:
OpenTelemetry 把這些標準化:
常見兩種拓樸:
App (SDK) --OTLP--> 後端(如 Tempo/Prometheus 適配)
App (SDK) --OTLP--> OTel Collector --export--> Tempo/Jaeger + Prometheus/OTLP + Logs Sink
好處:集中重試、緩衝、過濾、取樣與多目標輸出;部署彈性更大。
dev
或 obs
extra)# pyproject.toml
[project.optional-dependencies]
obs = [
"opentelemetry-sdk>=1.27",
"opentelemetry-exporter-otlp>=1.27",
"opentelemetry-instrumentation-fastapi>=0.49b0",
"opentelemetry-instrumentation-httpx>=0.49b0",
"opentelemetry-instrumentation-logging>=0.49b0",
"opentelemetry-instrumentation-requests>=0.49b0",
"opentelemetry-instrumentation-sqlalchemy>=0.49b0",
"prometheus-client>=0.20" # 若需要 /metrics(Pull 模式)
]
[tool.hatch.envs.obs]
features = ["obs"]
export OTEL_SERVICE_NAME=awesome-api
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
export OTEL_RESOURCE_ATTRIBUTES=service.version=1.2.3,service.namespace=payments,env=dev
# 取樣比率(0~1)
export OTEL_TRACES_SAMPLER=traceidratio
export OTEL_TRACES_SAMPLER_ARG=0.2
# src/my_project/obs/telemetry.py
from opentelemetry import trace, metrics
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
import os
def setup_otel() -> None:
res = Resource.create({
"service.name": os.getenv("OTEL_SERVICE_NAME", "awesome-api"),
"service.version": os.getenv("SERVICE_VERSION", "0.0.0"),
"service.namespace": os.getenv("SERVICE_NAMESPACE", "default"),
"deployment.environment": os.getenv("ENV", "dev"),
})
# Traces
tp = TracerProvider(resource=res)
tp.add_span_processor(BatchSpanProcessor(OTLPSpanExporter())) # OTLP gRPC 4317
trace.set_tracer_provider(tp)
# Metrics
reader = PeriodicExportingMetricReader(OTLPMetricExporter()) # OTLP gRPC 4317
mp = MeterProvider(resource=res, metric_readers=[reader])
metrics.set_meter_provider(mp)
# src/my_project/adapters/web/app.py
from fastapi import FastAPI
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from my_project.obs.telemetry import setup_otel
def create_app() -> FastAPI:
setup_otel()
app = FastAPI(title="Awesome API", version="1.2.3")
# 自動化框架/庫的追蹤
FastAPIInstrumentor.instrument_app(app)
HTTPXClientInstrumentor().instrument()
RequestsInstrumentor().instrument()
# SQLAlchemyInstrumentor().instrument(engine=your_engine) # 若有使用
@app.get("/healthz")
def healthz(): return {"ok": True}
return app
app = create_app()
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def compute_quote(user_id: str, items: list[dict]) -> int:
with tracer.start_as_current_span("compute_quote") as span:
span.set_attribute("user.id", user_id)
span.set_attribute("items.count", len(items))
# ... heavy logic ...
price = 123
span.add_event("quote_computed", {"price": price})
return price
實作例:HTTP 業務成功率與延遲直方圖
from opentelemetry import metrics
meter = metrics.get_meter(__name__)
orders_success = meter.create_counter("orders_success_total")
orders_failed = meter.create_counter("orders_failed_total")
latency = meter.create_histogram("orders_latency_ms", unit="ms")
def place_order(user_id: str, payload: dict) -> str:
import time
t0 = time.perf_counter()
try:
# ... business ...
orders_success.add(1, {"route": "POST /v1/orders"})
return "ok"
except Exception:
orders_failed.add(1, {"route": "POST /v1/orders"})
raise
finally:
latency.record((time.perf_counter() - t0) * 1000, {"route": "POST /v1/orders"})
SLO 入門:
RED(Requests, Errors, Duration)套在每個 API。
USE(Utilization, Saturation, Errors)套在資源(CPU、連線池、佇列)。
指標一律帶上
route
、status_class
、env
、service.version
這些標籤,告警與分群才有意義。
延續 JSON Log,把目前上下文的追蹤資訊注入:
# src/my_project/logging_config.py
import logging, structlog
from opentelemetry.trace import get_current_span
def _otel_ids(_, __, event_dict):
span = get_current_span()
ctx = span.get_span_context()
if ctx and ctx.is_valid:
event_dict["trace_id"] = f"{ctx.trace_id:032x}"
event_dict["span_id"] = f"{ctx.span_id:016x}"
return event_dict
def setup_logging():
logging.basicConfig(level=logging.INFO)
structlog.configure(
processors=[
structlog.processors.add_log_level,
_otel_ids,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer(),
]
)
這樣查到一筆錯誤 Log,可以直接用 trace_id
打開整條分散式追蹤。
otel-collector.yaml
(接收 OTLP,輸出到 Tempo 與 Prometheus Remote Write 只是示意)
receivers:
otlp:
protocols:
grpc:
http:
exporters:
otlp:
endpoint: tempo:4317 # Tempo/Jaeger 也可
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:9464" # 由 Prometheus 來拉 metrics(亦可 remote_write)
processors:
batch: {}
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
# docker-compose.yml
version: "3.9"
services:
otel-collector:
image: otel/opentelemetry-collector:latest
command: ["--config=/etc/otel.yaml"]
volumes:
- ./otel-collector.yaml:/etc/otel.yaml:ro
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "9464:9464" # Prometheus metrics exporter
tempo:
image: grafana/tempo:latest
ports: ["3200:3200"] # Tempo query
prometheus:
image: prom/prometheus:latest
ports: ["9090:9090"]
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
grafana:
image: grafana/grafana:latest
ports: ["3000:3000"]
prometheus.yml
(拉 Collector 的 /metrics)
global: { scrape_interval: 15s }
scrape_configs:
- job_name: "otelcol"
static_configs: [{ targets: ["otel-collector:9464"] }]
[tool.hatch.envs.obs.scripts]
up = [
"docker compose up -d otel-collector tempo prometheus grafana",
"python -c \"print('otel up')\""
]
down = "docker compose down"
serve = "uvicorn my_project.adapters.web.app:app --host 0.0.0.0 --port 8000"
OTEL_EXPORTER_OTLP_ENDPOINT
配好。deployment.environment=prod
進資源屬性。readinessProbe
只檢查輕量依賴;terminationGracePeriodSeconds
至少 10s,讓 BatchSpanProcessor 能 flush。graceful-timeout
,避免 span 丟失。http.server.request.duration
直方圖建立 P95/P99 告警門檻;error_rate > 某閾值
拉 Pager。client.duration
、client.error_rate
;快取命中率與 DB 連線池飽和度。症狀 | 可能原因 | 修法 |
---|---|---|
看不到部分 span | 取樣比率太低;Batch 未 flush | 暫時用 OTEL_TRACES_SAMPLER=always_on ;關閉前 force_flush() ;加長 OTEL_BSP_SCHEDULE_DELAY |
指標不更新 | 只記錄了 Counter,忘了 Histogram | 為延遲與大小使用 Histogram;確認 Reader/Exporter 週期 |
後端 429 或丟資料 | Collector 未啟 batch/重試 | 在 Collector 加 batch 與 retry (extensions/processors) |
Log 無 trace_id | 設定順序不對,或 Log 在 Span 外 | 確保先 setup_otel() 再 setup_logging() ;在請求上下文內寫 log |
端到端不連 | 代理或防火牆擋 4317/4318 | 用 HTTP 4318 或在內網架 Collector,再由 Collector 出網 |
到這一步,你的服務不只會跑,還看得見:Log 有上下文、Trace 有鍊、Metrics 有分佈,Collector 把資料穩穩送走。從此線上問題少靠猜,多靠圖。下一次線上延遲拉高,你會先打開延遲直方圖與特定 route 的 trace,再決定要不要回退或擴容,而不是在螢幕前面念經。把之前的文章 一起看,拼圖才完整。